To assess the feasibility of predicting the likelihood of infrastructure project delivery using data from the Pipeline dataset and easily accessible external sources.
The Pipeline dataset contains information on 5,940 infrastructure projects, including project status, funding status, procurement methods, timelines, and geographic details (latitude, longitude).
The brute force approach involved creating a weighting system to assign likelihoods based on project sector, region, funding source, and duration spent in specific phases. However, the dataset lacks crucial information, such as how long a project remains in specific statuses (e.g., “On hold,” “In Development”) or reasons behind the statuses. These factors would have made the modeling more straightforward.
While it is possible to manually extract status durations, there were insufficient data points to model the ‘likelihood’ of completion directly. Thus, we focused on an alternative approach—predicting total delay as a proxy for project delivery likelihood.
To support this, several features were engineered from the Pipeline data: • complete: Indicates whether the project is complete. • planned_duration: Planned project duration. • actual_duration: Actual duration of the project. • budget_min and budget_max: Estimated budget range. • funding_status_indicator: Whether the funding source is confirmed. • project_status_indicator: Indicates whether the project is “In planning.” • total_delay: Calculated as the difference between the actual completion date and the estimated completion date for completed projects.
Predicting total delay allows us to estimate the likelihood of project delivery, with higher delays implying lower chances of on-time completion. The predicted delays were then scaled to a likelihood percentage (0-100%) for each project region and sector, normalizing the values to assess the probability of on-time delivery.
The output below shows a snapshot of predicted delay and likelihood of project delivery from linear model 1
## PrimaryKey ProjectName
## 1 IP008800 Waikuku Beach Kings Ave Wastewater Rising Main Replacement
## 2 IP011147 Unsealed pavement maintenance
## 3 IP012589 Joyce Rd WSTP Technology Renewal
## 4 IP013179 Glenshea Water Supply - Reservoir Repairs
## ProjectRegion predicted_delay likelihood
## 1 Canterbury -105.93118 92.1829843
## 2 Bay of Plenty -3411.61797 0.5180428
## 3 Bay of Plenty -374.90774 88.6389176
## 4 Waikato 38.35062 82.8519911
The output below shows a snapshot of predicted delay and likelihood of project delivery from linear model 2
## ProjectName ProjectRegion
## 1 1495_001 - Clarke St Landscaping and Car Park Closure Waikato
## 2 Structures maintenance Bay of Plenty
## 3 North Western Bus Improvements(NWBI)(ST) Auckland
## 4 Unsealed Road Metalling Bay of Plenty
## EstimatedQuarterProjectRangeCompletion predicted_delay likelihood
## 1 2025-04-01 52.429853 48.71237
## 2 2054-07-01 -2724.841818 0.00000
## 3 2024-10-01 -183.471795 46.26339
## 4 2025-04-01 -0.329693 54.92067
## ProjectStatus
## 1 Early planning
## 2 In planning
## 3 Post implementation
## 4 In planning
This graph below shows the distribution of expected delays across regions for the planned projects. The fit for LM1 was not strong enough to use in production or for decision-making.
The graph below shows the distribution of likelihood project delivery across regions for the planned projects. The likelihood distribution is all over the place failing to account for extreme cases.
This graph below shows the distribution of expected delays across regions for the planned projects in Transport sector. The fit for LM2 was better than LM1 enough to use in production or for decision-making.
The graph below shows the distribution of likelihood project delivery for the planned projects in Transport Sector.
Final input variables are average delay in the sector, minimum budget, maximum budget, planned duration, funding status (confirmed or not), project status, Project Region, GDP Delta and CPI Delta Model has an accuracy of ~80%
summary(delay_model_2)
##
## Call:
## lm(formula = total_delay ~ avg_group_delay + budget_min + budget_max +
## planned_duration + funding_status_indicator + project_status_indicator +
## ProjectRegionCode + CPI_Delta_Percent + GDP_Delta_Percent,
## data = completed_projects)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1804.80 -94.91 23.48 139.00 1268.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 175.63411 111.73779 1.572 0.1192
## avg_group_delay 0.73889 0.09267 7.973 2.89e-12 ***
## budget_min 6.20271 13.25293 0.468 0.6408
## budget_max -1.23297 5.58931 -0.221 0.8259
## planned_duration -0.25993 0.10248 -2.536 0.0128 *
## funding_status_indicator 91.92554 95.56151 0.962 0.3384
## project_status_indicator -32.35917 122.96808 -0.263 0.7930
## ProjectRegionCode 0.73967 9.60861 0.077 0.9388
## CPI_Delta_Percent 0.08534 0.03967 2.152 0.0339 *
## GDP_Delta_Percent 5.35848 10.89189 0.492 0.6238
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 342.8 on 98 degrees of freedom
## Multiple R-squared: 0.8086, Adjusted R-squared: 0.791
## F-statistic: 45.99 on 9 and 98 DF, p-value: < 2.2e-16
Residuals Histogram to check normality of residuals.